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Abstract 

Structured prediction tasks pose a fundamental trade-off between the need for model com- 
plexity to increase predictive power and the limited computational resources for inference in the 
exponentially- sized output spaces such models require. We formulate and develop the Structured 
Prediction Cascade architecture: a sequence of increasingly complex models that progressively 
filter the space of possible outputs. The key principle of our approach is that each model in the 
cascade is optimized to accurately filter and refine the structured output state space of the next 
model, speeding up both learning and inference in the next layer of the cascade. We learn cas- 
cades by optimizing a novel convex loss function that controls the trade-off between the filtering 
efficiency and the accuracy of the cascade, and provide generalization bounds for both accuracy 
and efficiency. We also extend our approach to intractable models using tree-decomposition 
ensembles, and provide algorithms and theory for this setting. We evaluate our approach on 
several large-scale problems, achieving state-of-the-art performance in handwriting recognition 
and human pose recognition. We find that structured prediction cascades allow tremendous 
speedups and the use of previously intractable features and models in both settings. 

1 Introduction 

The classical trade-off between approximation and estimation error (bias/ variance) is fundamental 
in machine learning. In regression and classification problems, the approximation error can be 
reduced by increasing the complexity of the model at the cost of higher estimation error. Stan- 



dard statistical model selection techniques (Mallows, 1973; Vapnik and Chervonenkis , 1974; Akaike 



1974; Devroye et al., 1996; Barron et al., 1999} [Bartlett et al. , 2002) explore a hierarchy of mod- 
els of increasing complexity primarily to minimize expected error, without much concern for the 
computational cost of using the model at test time. 

However, in structured prediction tasks, such as machine translation, speech recognition, articu- 
lated human pose estimation and many other complex prediction problems, test-time computational 
constraints play a critical role as models with increasing inference complexity are considered. In 
these tasks, there is an exponential number of possible predictions for every input. Breaking these 
joint predictions up into independent decisions (e.g., translate each word independently, recognize 
a phoneme at a time, detect arms separately) ignores critical correlations and leads to poor perfor- 
mance. On the other hand, structured models used for these tasks, such as grammars and graphical 
models, can capture strong dependencies but at considerable cost of inference. For example, a first 
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order conditional random field (CRF) (Lafferty et al., 2001) is fast to evaluate but may not be 
an accurate model for phoneme recognition, while a fifth order model is more accurate, but pro- 
hibitively expensive for both learning and prediction. Model complexity can of course also lead to 
over-fitting problems due to the sparseness of the training data, but this aspect of the error is fairly 
well understood and controlled using standard regularization and feature selection methods. 

In practice, model complexity is limited by computational constraints at prediction time, either 
explicitly by the user or implicitly because of the limits of available computation power. We there- 
fore need to balance inference error with inference efficiency. A common solution is to use heuristic 
pruning techniques or approximate search methods in order to make more complex models feasible. 
For example, in statistical machine translation, syntactic models are combined with n-gram lan- 
guage models to produce impractically large inference problems, which are heavily and heuristically 



pruned in order to fit into memory and any reasonable time budget (Chiang et al. , 2005; Venugopal 



et al.[|2007t|Petrov et al.[|2008| ). However, previous work remains unsatisfactory in several respects: 
(1) model parameters are not learned specifically to balance the accuracy /efficiency trade-off, but 
instead using remotely related criteria, and (2) no optimality or generalization guarantees exist. 

In this paper, we address the accuracy /efficiency trade-off for structured problems by learning a 
cascade of structured prediction models, in which the input is passed through a sequence of models 
of increasing computational complexity before a final prediction is produced. The key principle 
of our approach is that each model in the cascade is optimized to accurately filter and refine the 
structured output state space of the next model, speeding up both learning and inference in the next 
layer of the cascade. Although complexity of inference increases (perhaps exponentially) from one 
layer to the next, the state-space sparsity of inference increases exponentially as well, and the entire 
cascade procedure remains highly efficient. We call our approach Structured Prediction Cascades 
(SPC). 

The contributions of this paper are organized as follows]^ 

• In Section [3j we describe the SPC inference framework for tree-structured problems where 
sparse exact inference is tractable. We also propose a tree-decomposition method for applying 



cascades to loopy graphical models in Section 3.2 



In Section [4| we describe how cascades can be learned to achieve a desired accuracy /efficiency 
trade-off on training data. We introduce a novel convex loss function specifically geared for 
learning to filter accurately and effectively, and describe a simple stochastic subgradient 
algorithm for learning a cascade one layer at a time. 

In Section [sj we provide a theoretical analysis of the accuracy /efficiency trade-off of the cas- 
cade We develop novel generalization bounds for both accuracy and efficiency of a structured 
prediction model. 

In Section [6j we explore in depth two applications of the SPC framework in which the cascades 
achieve best-known performance. In Section [6. 1| we show how SPC can be applied to linear- 
chain models for handwriting recognition. In Section [6^ we demonstrate the use of SPC for 
single-frame human pose estimation using a pictorial structures tree model cascade. Finally, 
in Section |6.3[ we show how SPC can be applied to the estimating pose in video, using the 



framework for loopy graphical models introduced in section 3.2 



^Preliminary analysis and applications of structured prediction cascade was developed in (Weiss and Taskar 2010 



Sapp et al. 2010b Weiss et al. 2010) 
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2 Related Work 



The trade-off between computation time and model complexity, which is the central focus of this 
work, has been studied before in several settings we outline in this section. 



Training-time Computation Trade-OfFs. Several recent works have considered the trade-off 
between estimation error and computation (number of examples processed) at training time in 



large-scale classification using stochastic/online optimization methods, notably (Shalev-Shwartz 
and Srebroj |2008| [Bottou and Bousquetj |2008D . We use such stochastic sub-gradient methods in 
our learning procedure (Section |4|). In more recent theoretical work, Agarwal et al. (2011) also 



address the issue of estimation time by incorporating computational constraints into the classical 



empirical risk minimization framework. However, as above, Agarwal et al. (2011) assume that 
model selection requires choosing between different methods with fixed test-time computational cost 
for all examples. In this paper, we instead analyze adaptive computational trade-offs in structured 
inference at test-time, and analyze the trade-offs in terms of novel loss functions measuring efficiency 
and accuracy. 



Test-time Computation Trade-OfFs. The issue of controlling computation at test-time also 
comes up in kernelized classifiers, where prediction speed depends on the number of "support vec- 



tors". Several algorithms, including the Forgetron and Randomized Budget Perceptron (Crammer 



et al. 


, 2003; 


Dekel et al. 


, 2008; 



Cavallanti et al., 2007), are designed to maintain a limited active 



set of support vectors in an online fashion while minimizing error. However, unlike our approach, 
these algorithms learn a model that has a fixed running time for each test example. In contrast, 
our approach addresses structured prediction problems and has a example- adaptive computational 
cost that allows for more computation time on more difficult examples, and greater efficiency gains 
on examples where simpler models suffice. 



Cascades/Coarse-to-fine reasoning. For binary classification, cascades of classifiers have been 
quite successful for reducing computation. |Fleuret and Geman (2001) propose a coarse-to-fine 
sequence of binary tests to detect the presence and pose of objects in an image. The learned 
sequence of tests is trained to minimize expected computational cost. The extremely popular 



classifier of Viola and Jones (2001) implements a cascade of boosting ensembles, with earlier stages 
using fewer features to quickly reject large portions of the state space. More recent work on binary 
classification cascades has focused on further increasing efficiency, e.g. through joint optimization 



( [Lefakis and Fleuret[ |2010| ) or selecting features at test time ( |Gao and Koller[ |2011[ ). Our cascade 
framework is inspired by these binary classification cascades, but poses new objectives, inference, 
and learning algorithms, to deal with the structured inference setting. 



In natural language parsing, several works (|Charniak 



2000 



Carreras. et al., 2008; Petrov, 2009) 



use a coarse-to-fine idea closely related to ours and Fleuret and Geman (2001): the marginals of 
a simple context free grammar or dependency model are used to prune the parse chart for a more 
complex grammar. We compare to this idea in our experiments. The key difference with our work 
is that we explicitly learn a sequence of models tuned specifically to filter the space accurately 



and effectively. Unlike the work of Petrov (2009), however, we do not learn the structure of the 



hierarchy of models but assume it is given by the designer. Rush and Petrov (2012) apply the 



ideas developed in our preliminary work to the problem of dependency parsing in natural language 



processing. Rush and Petrov (2012) learn a cascade of simplified parsing models using the objective 
presented in section [4] to achieve state-of-the-art performance in dependency parsing across several 
languages at about two orders of magnitude less time. 
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Felzenszwalb et al. (2010) proposed a cascade for a structured parts-based object detection 
model. Their cascade works by early stopping while evaluating individual parts, if the combined 
part scores are less than fixed thresholds. While the form of this cascade can be posed in our 
more general framework (a cascade of models with an increasing number of parts), we differ from 



Felzenszwalb et al. (2010) in that our pruning is based on thresholds that adapt based on inference 



in each test example, and we explicitly learn parameters in order to prune safely and efficiently. 



In [Fleuret and Geman| ( |2001| ) ; [Viola and Jones] ( |2001| ) ; [Felzenszwalb et al.| ( [2010| ) , the focus is on 



preserving established levels of accuracy while increasing speed. 

3 Structured Prediction Cascades (SPC) 

Given an input space output space 3^, and a training set {{x^^y^^^ . . . , of n samples 

from a joint distribution y), the standard supervised learning task is to learn a hypothesis 

h : X y that minimizes the expected loss [C {h{X),Y)] for some non-negative loss function 
C : yxy ^ R+. In structured prediction problems, y is a ^-vector of variables and 3^ = x • • • x 3^^, 
and yi = {1, . . . , K}. In many settings, the number of random variables, £, differs depending on 
input X, but for simplicity of notation, we assume a fixed £ here. Note that for the rest of this 
paper, we will use capital letters X and Y to denote random variables drawn from D(X,Y) and 
lower-case letters x and y to denote specific values of X and Y. We use subscripts to index elements 
of y, where y^ is the ith component of y. 

The linear hypothesis class we consider is 

h{x) = argmax^^f (x, y), (1) 

where the scoring function is the inner product of a vector of parameters G and a feature 
function f : X x y R.^ mapping (x, y) pairs to a set of features. We further make the standard 
assumption that f decomposes over a set of cliques C over output variables, so that 

^^f(x,y) = 5]^^fc(x,yc). (2) 

ceC 

We use the notation yc to denote the subset of variables involved in clique c, yc — {Vi \ i ^ c}- 
Similarly, we use J^c — 3^ii x . . . x yi^^^ where c = {zi, . . . , i\c\}, to refer to the set of all assignments to 
yc- By considering different cliques over X and Y, f can represent arbitrary interactions between the 
components of x and y. Computing the argmax in h(x) is tractable for low-treewidth (hyper)graphs 
but is NP-hard in general, and approximate inference is typically used when graphs are not low- 
treewidth. We will abbreviate 9^f{x,y) as 0{x,y) below, and similarly 9^fc{x,y) as 9{x,yc). 

In this section, we introduce the framework of Structured Prediction Cascades (SPC) to handle 
problems for which the inference problem in Eq. [T]is prohibitively expensive. For example, in a 5-th 
order linear chain model for handwriting recognition or part-of-speech tagging, K is about 50 char- 
acters or parts-of-speech, and exact inference is on the order 50^ ^15 billion times the length the 



sequence. In tree-structured models we have used for human pose estimation (Sapp et al. , 2010b), 
typical K for each part includes image location and orientation and is on the order of 250, 000, so 
computing pairwise features is prohibitive. Rather than learning a single monolithic model, a 
structured prediction cascade is a coarse-to-fine sequence of increasingly complex models 9^, ... ,9^ 
with corresponding features f^, . . . , f-^. For example, inference complexity scales exponentially with 
Markov order in sequence models, and quadratically with spatial/angular resolution in pose models. 
The goal of each model is to filter out a large subset of possible values for y without eliminating the 
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Figure 1: A high-level overview of the SPC inference framework. As the cascade progresses, the repre- 
sentational power of the models increases, yet tract ability is maintained by sufficient filtering of the state 
space. 

correct one, so that the next level only has to consider a much reduced state-space. The filtering 
process is feed-forward, and each stage runs inference to compute max-marginals which are used to 
eliminate low-scoring node or clique assignments. 

In summary, a high-level overview of the SPC inference framework is as follows. Below, 5^ 
denotes a sparse (filtered) version of the output space 3^: 

• Given an input x, initialize the cascade with = y. 

• Repeat for each level z = 0, . . . , T — 1 of the cascade: 

— Run sparse inference over 5^ using model y') and eliminate a subset of low-scoring 
outputs. 

— Output 5^^^ for the next model. 

• Predict using the final level: y = argmax^/^^T 9^{x^y'). 

The process is illustrated in Figure [TJ See Figure [2] for a concrete example of the output of a 
the first two stages of a cascade for handwriting recognition (Figure [2]) and human pose estimation 
(Figure [s]). We will discuss how to represent and choose 5^ in the next section. The key challenge is 
that 5^ are exponential in the number of output variables, which rules out explicit representations. 
The representation we propose is implicit and concise. It is also tightly integrated with parameter 
estimation algorithm for 0^ that optimizes the overall accuracy and efficiency of the cascade. 

3.1 Cascaded inference with max-marginals 

In order to filter low-scoring outputs, we use max-marginals^ for reasons that we detail below. For 
any value of yc^ we define the max-marginal 9~^(x^yc) to be the maximum score of any output y 
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Symbol 




Meaning 






input space, variable and value 






output space, variables and value 






set of cliques, individual clique, clique assignment 






features of input / output pair 






features of a clique assignment 


y) 


score of input /output pair 




Vc) 


score of a clique assignment 




^^0{x,y') 


max- marginal of a clique assignment yc 


y''{x,yc]9) = argmax^/.^/ 




best scoring output consistent with clique assignment yc 



Table 1: Summary of key notation. 



that is consistent with the assignment yc'. 

9\x,yc)^ max 9{x,y'). (3) 

y •yc=yc 

Max-marginals can be computed exactly and efficiently for any clique c in low-treewidth graphs, 
although the computational cost is exponential in |c| (the number of variables in the clique) when 
the state-space is not filtered. Note that max-marginals can be computed over any clique c, not just 
the cliques used in the feature function f; for example, in Section [6^ we compute max-marginals 
over single variables (i.e., yc — yj when performing human pose estimation (Figure Isl), but at 



increasingly higher resolutions. On the other hand, in Section 6.1 we compute max-marginals over 
increasingly large cliques for sequence models (e.g. bigram, trigrams, and quadgrams). 

Exact computation of max-marginals for a clique c requires the same amount of time to run as 
standard exact MAP inference. This process is visualized in Figure |3j once forward and backward 
max-sum messages have been computed for MAP inference, the max-marginal for a given value yc 
is simply the sum of the score 0(x, yc) plus the incoming messages to the variables in c. Note that 
in practice, both stages of computation become faster as the output space becomes increasingly 
sparse as the input proceeds through the cascade. This algorithm can also compute the maximizing 
assignment for each ?/c, 

y'^{x,yc;9) = argmax6>(x,y')- (4) 

y'-y'c=yc 

We call y~^{x^ yc] 0) the argmax-marginal or witness for yc (it might not be unique, so we break ties 
in an arbitrary but deterministic way). 

Once max-marginals have been computed, we filter the output space by discarding any clique 
assignments yc for which 9^{x^yj) < t for a threshold t (Figure [4]). This ffitering rule has two 



desirable properties for the cascade that follow immediately from the definition of max-marginals: 

Lemma 1 (Safe Filtering). If 9{x,y) > t, then Vc 9~^{x,yc) > t. 

Lemma 2 (Safe Lattices), //max^/ 9{x^y') > t, then 3y Vc 9~^(x,yc) > t. 

By Lemma [l] ensuring that the score of the true label 9{x^y) is greater than the threshold is 
sufficient (although not necessary) to guarantee that no marginal assignment yc consistent with 
the true global assignment y will be filtered. This condition will allow us to define a max-marginal 
based loss function that we propose to optimize in Section [4] and will analyze in Section [5| Lemma 
[2] follows from Lemma [TJ which states that so long as the threshold is less than the maximizing 
score, there always exists a global assignment y with no pruned cliques (i.e., a valid assignment 
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(a) 



(b) 



? f 



i: f 



O — Q — Q — Q — O 

k r n ? 





Figure 2: Sample output from the first two layers of a cascade. Circles represent output variables, and 
the dashed lines indicate cliques that are being filtered at a given level of the cascade, with the attached 
tables representing the sparse state space. The solid lines indicate the graph used for inference and features, 
(a) Output from a handwriting recognition cascade (Section |6.1[ ) of increasing Markov order. The first level 
outputs a sparse set of possible letters for each image. The second level takes as input the sparse set of letters, 
and further refines this to a very sparse set of bigrams at each position, (b) Output from a coarse-to-fine 
human pose cascade (Section 6.2). The colored areas indicate valid 2D locations for each joint. Unlike the 
sequence cascade (a), the cliques stay the same from one layer to another. Instead, the resolution of the 
state space doubles with each additional layer. 



always exists after pruning). Thus, Lemma [2] guarantees that > 1 in the SPC algorithm 

introduced above, and therefore the cascade will always produce a valid output. Note that neither 
property generally holds for standard sum-product marginals p(yc\x) of a log-linear CRF (where 
p{y\x) oc e^(^'^)), which motivates our use of max-marginals. 

The next component of the inference procedure is choosing a threshold t for a given input x 
(Figure [4]). Note that the threshold cannot be defined as a single global value but should instead 
depend strongly on the input x and 0(x^-) since scores are on different scales for different x. 
We also have the constraint that computing a threshold function must be fast enough such that 
sequentially computing scores and thresholds for multiple models in the cascade does not adversely 
effect the efficiency of the whole procedure. One might choose a quantile function to consistently 
eliminate a desired proportion of the max-marginals for each example. However, quantile functions 
are discontinuous in function, and we instead approximate a quantile threshold with a threshold 
function that is continuous and convex in 9. We call this the max-mean-max threshold function 
(Figure [4]), and define it as a convex combination of the maximum score and the mean of the 
max-marginals : 

t(x;6>,q:) = q: max 6>(x,y) + (l- a) ^^-^——V V O'^ix.yc). (5) 
Choosing a threshold using ^ is therefore equivalent to picking a a G [0, 1). Note that r(x; 9, a) is 
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Figure 3: Computing max- marginals over bigrams via message passing. The input is the same as in Figure [2] 
Once forward and backward messages have been computed, the max-marginal is simply the sum of incoming 
messages and the score of the clique over bigrams. 



a convex function of 9 (in fact, piece- wise linear), which combined with Lemma [T] will be important 
for learning the filtering models and analyzing their generalization. In our experiments, we found 
that the distribution of max-marginals was well centered around the mean, so that choosing a ^ 
resulted in ^ 50% of max-marginals being eliminated on average. As a approaches 1, the number 
of max-marginals eliminated rapidly approaches 100%j^ 

In summary, the inner loop of the SPC algorithm can be detailed as follows. The sparse output 
space is a list of valid assignments yc for each clique c in the model (e.g.. Figure^: 



5^ = {yc I Vc E C} (list of valid clique values for all cliques) (6) 

Next, sparse max-sum message passing is used to compute max-marginals 9~^{x^yc) ^ for each 
value yc E 3^c of each clique c of interest. Finally, for a given a, a threshold is computed and low- 
scoring values of 9'^{x^ yc) are eliminated. Depending on the model in the next layer of the cascade, 
further transformation of the states may be necessary: For example, in the coarse-to-fine pose 
cascade (Section |6.2[ ), valid 2-D locations for each limb are halved either vertically or horizontally 
to produce finer-resolution states for the next model (Figure [2]3) . 

3.2 Cascaded Inference in Loopy Graphs 

Thus far, we have assumed that (sparse) inference is feasible, so that max marginals can be com- 
puted. In this section, we describe how to apply SPC when exact max-sum message passing is 
computationally infeasible due to loops in the graph structure of the model. In order to simplify 
the presentation in this section, we will assume that the structured cascade under consideration op- 
erates in a "node-centric" coarse-to-fine manner as follows: For each variable yj in the model, each 
level of the cascade filters a current set of possible states 3^j, and any surviving states are passed 
forward to the next level of the cascade by substituting each state with its set of descendents in 



a hierarchy. For example, such hierarchies arise in pose estimation (Section 6.2) by discretizing 
the articulation of joints at multiple resolutions, or in image segmentation due to the semantic 
relationship between class labels (e.g., "grass" and "tree" can be grouped as "plants," "horse" and 
"cow" can be grouped as "animal.") Thus, in the pose estimation problem, surviving states are 



^We use cross-validation to determine the optimal a in our experiments (see Section 6| 
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Figure 4: Thresholding bigrams using max-marginals. The input is the same as in Figure (2] The sparse 
set of unfiltered bigrams is shown at each position according to the max-marginal score. The bigrams 
corresponding to the correct label sequence, brace, are highlighted in green. The green dashed line indicates 
the score of the correct label sequence. Note that the max-marginals of the correct sequence are at least 
the score of the correct sequence. The black dashed line indicates the maximum score of any sequence, 
which is the maximum filtering threshold. The largest max-marginal values are all exactly equal to this 
score. The red dashed lines indicate two candidate filtering thresholds r{x;6^a) = and r(x;6>,a) = 0.5 
and corresponding sets of filtered bigrams are highlighted. Note that a filtering error occurs at the more 
aggressive level of a = 0.5. 



subdivided into multiple finer-resolution states; in the image segmentation problem, broader object 
classes are split into their constituent classes for the next level. 

The key idea of this section is that we decompose the loopy model into a collection of equivalent 
tractable sub-models for which inference is tractable. What distinguishes this approach from other 
decomposition based methods (e.g., Komodakis et al. (2007); Bertsekas ( 1999[ )) is that, because 
the cascade's objective is filtering and not decoding, our approach does not require enforcing the 
constraint that the sub-models agree on which output has maximum score. In preliminary work 
(Weiss et al., 2010), this approach was called structured ensemble cascades, here we simply refer to 
it as Ensemble-SPC. 

Given a loopy (intractable) graphical model, it is always possible to express the score of a given 
output 0(x,y) as the sum of P scores 9p(x,y) under sub-models that collectively cover every edge 
in the loopy model: 9{x,y) — J^p^pi^^v) (Figure [5|. However, it is not the case that optimizing 
each individual sub-model separately will yield the single globally optimum solution. Instead, care 
must be taken to enforce agreement between sub-models. For example, in the method of dual 



decomposition (Komodakis et al.[ 2007), it is possible to solve a relaxed MAP problem in the 



(intractable) full model by running inference in the (tractable) sub-models under the constraint 
that all sub-models agree on the argmax solution. Enforcing this constraint requires iteratively 
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Figure 5: Example decomposition of a 3 x 3 fully connected grid into all six constituent "comb" 
trees. In general, a n x n grid yields 2n such trees. 

re-weighting unary potentials of the sub-models and repeatedly re-running inference until each 
sub-model convergences to the same argmax solution. 

However, for the purposes of SPC, we are only interested in computing the max-marginals 
0~^{x,yj). In other words, we are only interested in knowing whether or not a configuration y 
consistent with yj that scores highly in each sub-model 6p{x,y) exists. We show in the remainder 
of this section that the requirement that a single y consistent with yj optimizes the score of each 
submodel (i.e, that all sub-models agree) is not necessary for the purposes of filtering. Thus, because 
we do not have to enforce agreement between sub-models, we can apply SPC to intractable (loopy) 
models, but pay only a linear (factor of P) increase in inference time over the tractable sub-models. 

Formally, we define a single level of the Ensemble-SPC as a set of P models such that 6>(x, y) = 
^pOp{x^y). We let 0p{x,yc)^ ^p(^) r{x;6p,a) denote the max-marginals, max score, and 
threshold of the p'th model, respectively. Recall that the argmax-marginal or witness y~^(x^ yj; Op) is 
defined as the maximizing complete assignment of the corresponding max-marginal 9p{x^ yj). Then 
we have that 

e''{x, yj) = ^ (9^(x, yj) (with agreement: y = y''{x, yj] 0p), Vp) (7) 
p 

r (x, yj) < ei{x, yj) (in general) (8) 

p 

Note that if we do not require the sub-models to agree, then 9~^{x^ yj) is strictly less than Op{x^ yj). 
Nonetheless, as we show next, the approximation 9'^{x^ yj) ^ J2p ^pi^i Vj) is still useful and sufficient 
for filtering in a structured cascade. 

We now show that if a given label y has a high score in the full model, it must also have a large 
ensemble max-marginal score, even if the sub-models do not agree on the argmax. This extends 
Lemma [T] for the ensemble case, as follows: 

Lemma 3 (Joint Safe Filtering). If^pOp{x,y) > t, then J^p^pi^^Vj) ^ ^ /^^ all j. 

Proof. In English, this lemma states that if the global score is above a given threshold, then the 
sum of sub- model max-marginals is also above threshold (with no agreement constraint). The 
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proof is straightforward. For any yj consistent with we have 6p{x,yj) > 9p{x^y). Therefore 



Therefore, we see that an agreement constraint is not necessary in order to filter safely: if we 
ensure that the combined score Op{x^ y) of the true label y is above threshold, then we can filter 
without making a mistake if we compute max-marginals by running inference separately for each 
sub- model. However, there is still potentially a price to pay for disagreement. If the sub-models 
do not agree, and the truth is not above threshold, then the threshold may filter all of the states 
for a given variable yj and therefore "break" the cascade. This results from the fact that without 
agreement, there is no single argmax output y~^ that is always above threshold for any a; therefore, 
we do not have an equivalent to Lemma [2] for the ensemble case. However, we note that in our 
experiments (Section |6.3| ), we never experienced such breakdown of the cascades. 

4 Learning Structured Prediction Cascades 

When learning a cascade, we have two competing objectives that we must trade off: 

• Accuracy: Minimize the number of errors incurred by each level of the cascade to ensure an 
accurate inference process in subsequent models. 

• Efficiency: Maximize the number of filtered max-marginals at each level in the cascade to 
ensure an efficient inference process in subsequent models. 

Given a training set, we can measure the accuracy and efficiency of our cascade, but what is 
unknown is the performance of the cascade on test data. In section [5j we provide a guarantee that 
our estimates of accuracy and efficiency will be reasonably close to the true performance measures 
with high probability. This suggests that optimizing parameters to achieve a desired trade-off on 
training data is a good idea. 

We begin by quantifying accuracy and efficiency in terms of max-marginals, as used by SPC. We 
define the filtering loss >Cj to be a 0-1 loss indicating a mistakenly eliminated correct assignment. As 
discussed in the previous section, Lemma[l]states that an error can only occur if y) < r(x; 0, a). 
We also define the efficiency loss Ce to simply be the proportion of unfiltered clique assignments. 

Definition 1 (Filtering loss). A filtering error occurs when a max-marginal of a clique assignment 
of the correct output y is pruned. We define filtering loss as 



Definition 2 (Efficiency loss). The efficiency loss is the proportion of unpruned clique assignments: 



We now turn to the problem of learning parameters 9 and tuning of the threshold parameter a 
from training data. We have two competing objectives, accuracy (Cf) and efficiency (Ce), that we 
must trade off. Note that we can trivially minimize either of these at the expense of maximizing the 
other. If we set {0, a) to achieve a minimal threshold such that no assignments are ever filtered, then 

= and Ce = I. Alternatively, if we choose a threshold to filter every assignment, then Cf = 1 
while Ce — 0. To learn a cascade of practical value, we can minimize one loss while constraining 
the other below a fixed level e. Since the ultimate goal of the cascade is accurate classification, we 



EpO;ix,yj)>j:pU^,y)>t- 



Cf{x, y; e,a) = 1 [e{x, y) < t{x; (9, a)] . 



(9) 




(10) 



Algorithm 1 Forward Batch Learning of Structured Prediction Cascades. 

Input: Data {{x\ y^)}i^ structured feature generators f^, . . . , and parameters a^, . . . , a^~^. 

Output: Cascade parameters 9^, ... ,9^ . 

InitiaHze S^{x^) = y{x^) for each example. 

for t = to T - 1 do 

• Optimize (12) with sparse inference over the vahd set to find 9^. 

• Generate S^^^{x'^) from by fihering low-scoring clique assignments i/c where 

0'*{x\yc)<T{x';e\a') 

end for 

• Learn 9^ using structured predictor over sparse state spaces S^{x^). 



focus on the problem of minimizing efficiency loss while constraining the filtering loss to be below 
a desired tolerance. 

We express the cascade learning objective for a single level of the cascade as a joint optimization 
over 9 and a: 

mill Ex,Y [Ce{X, Y; 9, a)] s.t. Ex,y [Cf{X, Y; 9, a)] < e. (11) 

We solve this problem with for a single level of the cascade as follows. First, we define a convex 
upper-bound (12) on the ffiter error Cf, making the problem of minimizing Cf convex in 9 (given 
a). We learn 9 to minimize ffiter error for several settings of a (thus controlling ffitering efficiency). 
Given several possible values for 9, we optimize the objective ( [TTj ) over a directly, using estimates 
of Cf and Ce computed on a held-out development set, and choose the best 9. Note that in Section 
[5j we present a theorem bounding the deviation of our estimates of the efficiency and filtering loss 
from the expectation of these losses. 

For the first step of learning a single level of the cascade, we learn the parameters 9 for a fixed 
a using the following convex margin optimization problem: 

SPC: min ^| |^| p + i V 9, a), (12) 

i 

where H is a, convex upper bound on the filter loss Cf, 

H{x\ y'- 9, a) = max{0, 1 + t{x'- 9, a) - 9{x\ y')}. 

The upper-bound i7 is a hinge loss measuring the margin between the filter threshold t(x^; 0, a) and 
the score of the truth 0^f (x^, y^)] the loss is zero if the truth scores above the threshold by margin 
I (in practice, the length I can vary by example). We solve ( [T2| ) using stochastic sub-gradient 
descent. Given a sample (x,^), we apply the following update if H{9,x,y) (i.e., the sub-gradient) 
is non-zero: 

e' ^{l-r^\)e + ini{x,y)-inai{x,y^)-in{l-a)=J— J2 fix,y*{x,yc;e)). (13) 

Above, rj \s & learning rate parameter. The key distinguishing feature of this update compared 
to the structured perceptron update is that it subtracts features included in all max-marginal 
assignments y*(x, yc',9)- 



12 



Note that because (12) is A-strongly convex, we can choose ijt = and add a projection 



step to keep in a fixed norm-ball. The update then corresponds to the Pegasos update wit h 



convergence guarantees of 0(l/e) iterations for e- accurate solutions ( jShalev-Shwartz et al. 



2007). 



An overview of the entire learning process for the whole cascade is given in Algorithm [T, Levels 
of the cascade are learned incrementally using the output of the previous level of the cascade as 
input. Note that Algorithm [T] trades memory efficiency for time efficiency by storing the sparse 
data structures for each example. A more memory-efficient (but less time efficient) algorithm 
would instead run all previous layers of the cascade for each example during sub-gradient descent 



optimization of (12). 



Finally, in our implementation, we can sometimes achieve better results by further tuning the 
threshold parameters using a development set. We first learn 9^ using some fixed as before. 
However, we then choose an improved by maximizing efficiency subject to the constraint that 
ffiter loss on the development set is less than a tolerance e^: 



n 



<r- argmin >Ce(. 



a 



1 



Furthermore, we can repeat this tuning process for several different starting values of and pick 
the {9^^a^) pair with the optimal trade-off, to further improve performance. In practice, we find 
that this procedure can substantially improve the efficiency of the cascade while keeping accuracy 
within range of the given tolerance. 

It is straightforward to adapt Algorithm [T] for the Ensemble-SPC case. As in the previous 
section, we first define the natural loss function for sums of max-marginals, as suggested by Lemma 
[3j We define the joint filtering loss as follows. 

Definition 3 (Joint Filtering Loss). 



^ 9p(x, y) < ^ t(x; 9p, a) 



(14) 



We now discuss how to minimize the joint filter loss (14) given a dataset. We rephrase the SPC 



optimization problem (12) using the ensemble max-marginals to form the ensemble cascade margin 
problem. 



A 

mm — 
^i,...,^P,e>o 2 



E I i^^i I' + ^ E E ^('''^ y') ^ E ^(^'5 ^p, «) + f - c 



(15) 



Seeing that the constraints can be ordered to show < t(x^; Op^a) — Op{x^^ y^) -\- i\ we can 
form an equivalent unconstrained minimization problem. 



. A 

mm — 
Oi,...,Op 2 



J2r{x';ep,a)-ep{x\y') + e 



(16) 



where [z]^ = max{2:,0}. Finally, we take the subgradient of the objective in (16) with respect to 
each parameter 0p. This yields the following update rule for the p'th model: 



I \/0p{x^, y^j — \/t{x ; Op, a) otherwise. 
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This update is identical to the original SPC update with the exception that we update each model 
individually only when the ensemble has made a mistake jointly. Thus, learning to filter with 
the ensemble requires only P times as many resources as learning to filter with any of the models 
individually. We simply replace the optimization over ( 12) step in Algorithm[l]with an optimization 
over ([T6|). 



5 Generalization Analysis 

We now present generalization bounds on the filtering and efficiency loss functions for a single level 
of a cascade. To achieve bounds on the entire cascade, these can be combined provided that a fresh 
sample is used for each level. To prove the following bounds, we make use of Gaussian complexity 
results from Bartlett and Mendelson| ( [2002 ') , which requires vectorizing scoring and loss functions 
in a novel structured manner (details in Appendix [A|). The main theorem in this section depends on 
Lipschitz dominating cost functions >Cj and Cj that upper bound Cf and Cq. Note that as 7 ^ 0, 
we recover Cf and Ce- 

Definition 4 (Margin-augmented losses). We define margin- augmented filtering and efficiency 
losses using the usual ^-margin function: 



r^{z) 



C2(x,y;9,a) = 



1 ifz<0 
1 — z/^ if < z < ^ 
,0 ifz>-f. 
r^{e{x,y) - r{x;e,a)) 



Ecec \yc 



r^{T{x;e,a) - e*{x,yc)). 



(18) 

(19) 
(20) 



Theorem 1. Fix a G [0,1] and let © be the class of all scoring functions 9 with \\6\\2 < B, 
let \C\ be the total number of cliques, m — ^^eC l-^c| be the total number of clique assignments, 
||fc(a^) yc)||2 < 1 for all X G X,c G C and yc G J^c- Then there exists a constant c such that for any 
integer n and any < 5 < 1 with probability 1 — 5 over samples of size n, every 6 E Q satisfies: 



E[Cf{X,Y;e,a)] < E £'j{X,Y;e,a 



+ 



cmB^J\C\ /81n(2/(5) 



n 



+ 



n 



E[£e(^,i^;^,a)] < E[/:2(X,y;^,a)] + 



cmB^J\C\ /81n(2/(5) 



+ 



7Vn 



n 



(21) 
(22) 



where E is the empirical expectation with respect to the sample. 

Theorem [T] provides theoretical justification for the definitions of the loss functions Ce and Cf 
and the structured cascade objective; if we observe a highly accurate and efficient filtering model 
(0, a) on a finite sample of training data, it is likely that the performance of the model on unseen 
test data will not be too much worse as n gets large. Theorem [l] is the first theoretical guarantee 
on the generalization of accuracy and efficiency of a structured filtering model. 

We now turn to ensemble setting and define an appropriate margin- augmented loss: 

Definition 5 (Ensemble margin- augmented loss). 



(23) 
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Theorem 2. Fix a G [0, 1] and let \\0p\\2 < B/P for all p, and ||fc(x, ^c)||2 ^ 1 for all x and yc- 
Then there exists a constant c such that for any integer n and any < 5 < 1 with probability 1 — 5 
over samples of size n, every 9 — {^i, . . . , 0p} satisfies: 



E [CjointiX, Y- e, a)] < E Cl,^^{X, Y; 9, a) 



cmBP^\C\ /81n(2/(5) 
^ n 



n 



(24) 



where E is the empirical expectation with respect to the sample. 
The proof of Theorem [2] is given in Appendix [Aj 



6 Applications 

In this section, we explore in detail several evaluations of structured prediction cascades. In Section 



6.1 we describe a structured prediction cascade for sequential data and apply this model handwriting 



recognition. In Section |6.2| we describe SPC for articulated human pose estimation from single 



frame images. Finally, in Section 6.3, we evaluate Ensemble-SPC on a synthetic image segmentation 



task and the problem of detection and tracking articulated poses in video. 



6.1 Linear-chain Cascade 

In this section, we apply the structured prediction cascades framework to sequence prediction 
tasks with increasingly high order linear-chain models. The state space of a linear chain model 
is Vz : = {1, . . . where K is the number of possible states. Thus, the size of the state 

space is K. A d-order linear-chain model has maximal cliques {x^yi^yi-i^ . . . ^yi^d}- Thus, for an 
order d clique, there are possible clique assignments, although we find that in practice very few 
high-order clique assignments survive the first few levels of the cascade (see Table [2j) 

For a d-order linear-chain model, the score of an output y is given by a combination of unary 
features and transition features, 

i id 
^(x,y) = 5^0o^fo(x,^,) + 5^5^^7f,(^„ ... (25) 

i=l i=l j=l 

where Oq is a set of parameters for unary features {{x,yi) that depend on a single output variable 
and 9j is a set of parameters scoring j-order transition features fj(^i, . . . , yi-j)- 

In general, any d-order linear-chain model can be equivalently represented as a bigram (2- 
order) model with K^~^ states. Thus, it is simplest to implement a cascade of sequence models of 
increasing order as a set of bigram models where the state space is increasing exponentially by a 
factor of K from one model to the next. Given a list of valid assignments in a d-order model, we 
can generate an expanded list of valid assignments S^^^ for a (d+ l)-order model by concatenating 
the valid d-grams with all possible additional states. 



6.1.1 Handwriting Recognition 

We first evaluated the accuracy of the cascade using the handwriting recognition dataset from 



Taskar et al. (2003). This dataset consists of 6877 handwritten words, with average length of ^8 



characters, from 150 human subjects, from the data set collected by |Kassel| ( |1995[ ). Each word was 
segmented into characters, each character was rasterized into an image of 16 by 8 binary pixels. 
The dataset is divided into 10 folds; we used 9 folds for training and a single withheld for testing 
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Figure 6: Sparsity of inference during an example sequence cascade. Each panel shows the complex- 
ity of inference on a different example from the OCR dataset at each position in the sequence. The 
total height of each bar represents the size of the valid assignments 5^, while the shaded portion 
represents the remaining assignments after thresholding. Although complexity rises as unigrams 
are expanded into bigrams, filtering with bigrams and trigrams quickly reduces complexity to a few 
possible assignments at each position. 



(note that Taskar et al. (2003) used 9 folds for testing and 1 for training due to computational 
limitations, so our results are not directly comparable). Results are averaged across all 10 folds. 

Our objective was to measure the improvement in predictive accuracy as higher order models 
were incorporated into the cascade. We trained six cascades, up to a sixth-order (sexagram) 
linear-chain model. This is significantly higher order than the typical third-order (trigram) models 
typically used in sequence classification tasks. Note that in practice, the additional accuracy gained 
by increasing the order of the model might be offset by the additional filtering errors incurred due to 
lengthening the cascade. Thus, each level of each cascade was tuned to achieve maximum efficiency 
subject to a maximum error tolerance e, whereby a was set such that no more than e filtering error 
was incurred by each level of the cascade. 

Results are summarized in Table [2j We found that using higher order models led to a dramatic 
gain in accuracy on this dataset, increasing character accuracy from 77.35% to 98.54% and increas- 
ing word accuracy from 26.74% to 96.16%. It is interesting to note that the word level accuracy of 
the sixth-order model is roughly equivalent to the character-level accuracy of the trigram model. 
Furthermore, using a development set, we found that a stricter tolerance was required to gain ac- 
curacy from fifth- and sixth-order models, as reflected in Table [2j Finally, compared to previous 
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Model Order 


1 


2 


3 


4 


5 


6 


Accuracy, Char. [%) 


77.35 


85.02 


96.20 


97.21 


98.27 


98.54 


Accuracy, Word (%) 


26.74 


45.67 


88.25 


91.35 


93.74 


96.16 


Filter Loss (%) 




0.50 


0.73 


1.00 


0.75 


0.57 


Tolerance (%) 


1.00 


1.00 


1.00 


1.00 


0.50 


0.25 


Avg. Num n-grams 


26.0 


127.97 


101.84 


18.80 


82.12 


73.36 



Table 2: Summary of handwriting recognition results. For each level of the cascade, we computed prediction 
accuracy (at character and word levels) using a standard voting perceptron algorithm as well as the filtering 
loss and average number of unfiltered n-grams per position for the SPC on the test set. 



head 




State space X 
for part i 



rows 



cols 



/angi 



es 



PS model 




ir, c) 




Figure 7: Basic PS model with state yi for a part Yi. Left: graphical model representation. Second: 
state space representation as a tensor. Rightmost two panels: Illustration of state space laid out 
as a stick figure representation in image coordinates. 



approaches on this dataset, our accuracies are much higher; the best previously reported result on 



this dataset was 90.19% ( |Daume et aL||2009D . 

In fact, the extremely high accuracies of our approach on this dataset highlight the particular 
features of this data. Due to the high number of subjects used, there are only 55 unique words in 
the handwriting recognition dataset. In fact, if just the first three letters of each word are given 
exactly, one can guess the identity of the word with 94.5% accuracy. Given more letters, it is 
possible to uniquely identify the word with 100% accuracy. However, due to inter-subject variance, 
previous approaches have not been able approach this theoretical performance. By being able to 
utilize very high order cliques, SPC overcomes this limitation. 

To gain intuition about the inference process of SPC, a detailed picture of the complexity of 
inference for a few representative examples is presented in Figure [6] for the fourth-order cascade 
model. This figure also demonstrates the flexibility of the cascade: although a single threshold is 
chosen, the max marginals around unambiguous portions of the input are eliminated first. 

6.2 Pictorial Structure Cascade 

Classical pictorial structures (PS) are a class of graphical models where the nodes of the graph 
represents object parts, and edges between parts encode pairwise geometric relationships. For 
modeling human pose, the standard PS model is a tree structure with unary potentials (also referred 
to as appearance terms) for each part and pairwise terms between pairs of physically connected 
parts. Figure [7| shows a PS model for 6 upper body parts, with lower arms connected to upper 



arms, and upper arms and head connected to torso. Note that in previous work fflamanan and 



Sminchisescuj |2006t [Felzenszwalb and Huttenlocher[ |2005t [Ferrari et al.[ |2008| [Andriluka et al. 



2009) (unlike the approach described in this section), the pairwise terms do not depend on data 
and are hence referred to as a "spatial" or "structural" prior. 

The state of part z, denoted as yi ^ yi, encodes the joint location of the part in image coordinates 
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Figure 8: Overview: A coarse-to-fine cascade of pictorial structures filters the pose space so 
that expressive and computationally expensive features can be used in the final pictorial structure. 
Shown are 5 levels of the coarse-to-fine cascade for the right upper and lower arm parts. Green 
vectors represent position and angle of unpruned states, the downsampled images correspond to 
the dimensions of the respective state space, and the white rectangles represent classification using 
our final model. 



and the direction of the limb as a unit vector: i/i = [yix yiy yiu UivY - The state of the model is the 
collection of states of I parts: y = [yi, . . . , y^]. The size of the state space for each part, is the 
number of possible locations in the image times the number of pre-defined discretized angles. For 
example, we model the state space of each part in a 80 x 80 grid for yix ^ yiyi with 24 different 
possible values of angles, yielding = 80 x 80 x 24 = 153, 600 possible placements. 
Given a part configuration, we define cliques over pairwise and unary terms: 



6>(x, y)^^ Ofjfij{x, yi, yj) + Of{i{x, yi 



(26) 



Thus, the parameters of the model are the pairwise and unary weight vectors Oij and 9i correspond- 
ing to the pairwise and unary feature vectors iij(x^yi^yj) and fi{x^yi). 



One of the reasons pictorial structures models have been so popular in the literature is that |Felzei]- 



szwalb and Huttenlocher (2005) proposed a way to perform max inference on (26) in linear time 



using distance transforms, which is only possible if the pairwise term is a quadratic function of the 
displacement between neighbors yi and yj. We wish to go beyond such a simple geometric prior 
for the pairwise term of (26), and thus rely on standard 0(|3^ip) dynamic programming techniques 
to compute the MAP assignment or part posteriors, as was the case for linear-chain models in the 
previous section. However, unlike linear-chain models, many highly-effective pairwise features one 
might design would be intractable to compute in this manner for a reasonably-sized state space — for 
example an 80 x 80 image with a part angle discretization of 24 bins yields |3^ip = 57.6 billion 
part-part hypotheses, far too many to store in a dynamic programming table (e.g.. Figure [s]). 



6.2.1 Coarse-to-Fine Resolution Cascade 

To overcome the issue of feature intractability, we define a coarse-to-fine structured prediction 
cascade over the resolution of the state space (Figure [2]3). Note that unlike the linear-chain 
cascade, the cliques do not change from one level to the next. Instead, the state space of each 
part in one model is subdivided to form the state space of the next model. Once again, we learn 
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Figure 9: Left: Detector-based pruning (0th order model) by thresholding yields many hypotheses 
far way from the true one for the lower right arm. The CPS (bottom row), however, exploits global 
information to perform better pruning. Right: PCP curves of our cascade method (blue) show 
increased accuracy versus a detection pruning approach (green), evaluated using PCP on arm parts. 



parameters 9 and a for the cascade using Algorithm [Ij The coarse-to-fine cascade is outlined in 
Figure |8| 

Max- marginals for the pose model can be visualized to provide some intuition for max marginals. 
In general, the max-marginal for location/angle 0'^{x,yi) is the score of the best global body pose 
which constrains Yi — yi. In a pictorial structure model, this corresponds to fixing limb i at 
(location, angle) ?/^, and determining the highest scoring configuration of other part locations and 
angles under this constraint. Thus, a part could have weak individual image evidence of being at 
location yi but still have a high max-marginal score if the rest of the model believes this is a likely 
location. 

While the fine-level, target state space has size 80 x 80 x 24, the first level cascade coarsens 
the state-space down to 10 x 10 x 12 = 1200 states per part, which allows for efficient exhaustive 
inference. In our experiments, we always set a = 0, effectively throwing away half of the states at 
each stage. After pruning we double one of the dimensions (first angle, then the minimum of width 
or height) and continue (see Table [s]). 

The coarse-to-fine stages use standard PS features. HoG part detectors are run once over the 
original state space, and the outputs are resized for features in the coarser state spaces. For pairwise 
features, we use the standard relative geometric cues of angle and displacement. The features are 
discretized uniformly, and thus multi-modal pairwise costs can be learned. 

Once the cascade has reduced the fine-level state space to a manageable size, we apply a boosted 
model with many expensive, powerful features. As can be seen in Table|3j the coarse-to-fine cascade 
leaves us with roughly 500 valid assignments per part; for each possible part location and valid 
part pairs, we compute features using image contours, moments of the shape and regions underlying 
each part, color and texture appearance models, color similarity between parts, and geometry. 

One practical detail differentiates this cascade from others discussed in this section. Rather than 
learn a standard structured perceptron for prediction in the final stage, we concatenate all unary 
and pairwise features for part-pairs into a feature vector and learn boosting ensembles which give 
us our pairwise clique score^ This method of learning clique scores has several advantages over 
stochastic subgradient learning: it is faster to train, can determine better thresholds on features 

^ We use OpenCV's implementation of Gentleboost and boost on trees of depth 3, setting the optimal number of 
rounds via a hold-out set. 
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state 


# states in the 


state space 


PCPo.2 


level 


dimensions 


original 


pruned 


reduction 


arms 






space 


space 


% 


oracle 





10x10x12 


153600 


1200 


00.00 




1 


10x10x24 


72968 


1140 


52.50 


54 


3 


20x20x24 


6704 


642 


95.64 


51 


5 


40x40x24 


2682 


671 


98.25 


50 


7 


80x80x24 


492 


492 


99.67 


50 


detection pruning 


80x80x24 


492 


492 


99.67 


44 



Table 3: For each level of the cascade we present the reduction of the size of the state space 
after pruning each stage and the quality of the retained hypotheses measured using PCPo.2- As a 
baseline, we compare to pruning the same number of states in the HoG detection map (see text). 



than uniform binning, and can combine different features in a tree to learn complex, non-linear 
interactions. In general, we can use any method of learning with sparse inference for the final stage 
of Algorithm [TJ 

6.2.2 Buffy and PASCAL Dataset Results 

We evaluated the pose cascade on the publicly available Buffy The Vampire Slayer v2.1 and PAS- 
CAL Stickmen datasets ( |Eichner. and Ferrari^ |2009| ) . We used the upper body detection windows 
provided with the dataset as input to localize and scale normalize the images before running our 
experiments as in Eichner. and Ferrari (2009); Ferrari et al. (2008); Andriluka et al. (2009[). The 



standard 235 Buffy test images were used for testing, as well as the 360 detected people from 
PASCAL stickmen. We used the remaining 513 images from Buffy for training and validation. 

The typical measure of performance on this dataset is a matching criteria based on both end- 
points of each part (e.g., matching the elbow and the wrist correctly): A limb guess is correct if 
the limb endpoints are on average within r of the corresponding groundtruth segments, where r is 
a fraction of the groundtruth part length. By varying r, a performance curve is produced where 
the performance is measured in the percentage of correct parts (PCP) matched with respect to r. 
We define PCP^ as the value of the curve at r. 

As shown in Table [4| the cascade performs comparably with the state-of-the-art on all parts, 
significantly outperforming earlier work. We also compared to a much simpler approach, inspired 
by Felzenszwalb et al. ( 2010| ) (detector pruning + rich features): We prune by thresholding each 



unary detection map individually to obtain the same number of states as in our final cascade level, 
and then apply our final model with rich features on these states. As can be seen in Figure [9| 
this baseline performs significantly worse than our method (performing about as well as a standard 
PS model as reported in ^Sapp et al.| ( |2010a| )). This makes a strong case for using max-marginals 
(e.g., a global image-dependent quantity) for pruning, as well as learning how to prune safely and 



efficiently, rather than using static thresholds on individual part scores as in Felzenszwalb et al 
( [20T0| ). 

In Table [3j we evaluate the test time efficiency and accuracy of our system after each successive 
stage of pruning. In the early stages, the state space is too coarse for the MAP state sequence 
of one of the pruning models to be meaningfully compared to the fine-resolution groundtruth, so 
we report PCP scores of the best possible as-yet unpruned state left in the original space. We 
choose a tight PCP0.2 threshold to get an accurate understanding of whether or not we have lost 
well-localized limbs. As seen in Table |3j the drop in PCP0.2 is small and linear, whereas the pruning 
of the state space is exponential — half of the states are pruned in the first stage. As a baseline. 
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Method 


Torso 


Head 


Upper 
Arms 


Lower 
Arms 


Total 


Buffy 












|Andriluka et al. 72009) 


90.7 


95.5 


79.3 


41.2 


73.5 


lEichner. and Ferrari (2009) 




y < .y 


OZ.O 


oy.o 


QO 1 

oU.i 


|ISapp et al. 


(2010a) 


100 


100 


91.1 


65.7 


85.9 


CPS (ours) 




7 


Q1 Q 






Detector pruning 


99.6 


87.3 


90.0 


55.3 


79.6 


PASCAL stickmen 












jEichner. and Ferrari (2009) 


97.22 


88.60 


73.75 


41.53 


69.31 


|Sapp et al. 


(2010a) 


100 


98.0 


83.9 


54.0 


79.0 


CPS (ours) 


100 


99.2 


81.5 


53.9 


78.3 



Table 4: Comparison to other methods at PCP0.5. See text for details. We perform comparably 
to state-of-the-art on all parts, improving on upper arms. (NOTE: the numbers included here 
are slightly different from the published version — what is seen here exactly matches the publicly 
available reference implementation at http: //vision. grasp. upenn.edu/video/j). 



we evaluate the simple detector-based pruning described above. This leads to a significant loss 
of correct hypotheses, to which we attribute the poor end-system performance of this baseline (in 
J:^ igure |9]and Table Q, even after adding richer features. 

6.3 Loopy Graphs with Ensemble-SPC 

We evaluated Ensemble-SPC in two experiments. First, we analyzed the "best-case" filtering per- 
formance of the summed max-marginal approximation to the true marginals on a synthetic image 
segmentation task, assuming the true scoring function 0{x,y) is available for inference. Second, we 
evaluated the real- world accuracy of our approach on a difficult, real- world human pose dataset 
(VideoPose). In both experiments, the max-marginal ensemble outperforms state-of-the-art base- 
lines. 



6.3.1 Asymptotic Filtering Accuracy on Synthetic Data 



We first evaluated the filtering accuracy of the max-marginal ensemble on a synthetic 8-class seg- 
mentation task. For this experiment, we removed variability due to parameter estimation and fo- 
cused our analysis on accuracy of inference. We compared our approach to Loopy Belief Propagation 
(Loopy BP) ( |PearH|1988||McEliece et all |1998| [Murphy et al.[|1999D , on a 11 x 11 two-dimensional 



grid MRF0 For the ensemble, we used 22 unique "comb" tree structures to approximate the full 
grid model. To generate a synthetic instance, we generated unary potentials (jJi(k) uniformly on 
[0,1] and pairwise potentials log-uniformly: ujij{k^k') — e~^, where v ^ 25,25] was sampled 
independently for every edge and every pair of classes. (Note that for the ensemble, we normalized 
unary and edge potentials by dividing by the number of times that each potential was included in 
any model.) It is well known that inference for such grid MRFs is generally difficult (Koller and 



Friedman] [2009 ) , and we observed that Loopy BP failed to converge for at least a few variables on 



most examples we generated. 

We evaluated our approach on 100 synthetic grid MRF instances. For each instance, we com- 
puted the accuracy of filtering using marginals from Loopy BP, the ensemble, and each individual 



^We used the UGM Matlab Toolbox by Mark Schmidt for the Loopy BP and Gibbs MCMC comparisons, see: 
[http: //people . cs .ubc . ca/~schmidtm/Sof tware/UGM.html^ 
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(a) Level m Level m+f 

Sub-models Sub-models 




Level m+1 



Frame 1 Frame i Frame N ^ ' Frame 1 Frame i Frame N 




Figure 10: (a) Schematic overview Ensemble-SPC for human pose tracking. The m'th level of the cascade 
takes as input a sparse set of states for each variable yj. The full model is decomposed into constituent 
sub-models (above, the three tree models used in the pose tracking experiment) and sparse inference is run. 
Next, the max marginals of the sub-models are summed to produce a single max marginal for each variable 
assignment: = ^pOp{x^yj). Note that each level and each constituent model will have different 

parameters as a result of the learning process. Finally, the state spaces are thresholded based on the max- 
marginal scores and low-scoring states are filtered. Each state is then refined according to a state hierarchy 
(e.g., spatial resolution, or semantic categories) and passed to the next level of the cascade. This process can 
be repeated as many times as desired. In (b), we illustrate two consecutive levels of the ensemble cascade 
on real data, showing the filtered hypotheses left for a single video example. 



(b) 



Level m 

A 



sub-model. We determined error rates by counting the number of times "ground truth" was in- 
correctly filtered if the top K states were kept for each variable, where we sampled 1000 "ground 
truth" examples from the true joint distribution using Gibbs sampling. To obtain a good estimate 
of the true marginals, we restarted the chain for each sample and allowed 1000 iterations of mixing 
time. The result is presented in Figure 11 for all possible values of K (filter aggressiveness.) We 
found that the ensemble outperformed Loopy BP and the individual sub-models by a significant 
margin for all K. 

We next investigated the question of whether or not the ensembles were most accurate on 
variables for which the sub-models tended to agree. For each variable yij in each instance, we 
computed the mean pairwise Spearman correlation between the ranking of the 8 classes induced by 
the max marginals of each of the 22 sub-models. We found that complete agreement between all sub- 
models never occurred (the median correlation was 0.38). We found that sub-model agreement was 
significantly correlated (p < 10~^^) with the error of the ensemble for all values of K, peaking at p = 
—0.143 at K = 5. Thus, increased agreement predicted a decrease in error of the ensemble. We then 
asked the question: Does the effect of model agreement explain the improvement of the ensemble 
over Loopy BP? In fact, the improvement in error compared to Loopy BP was not correlated with 
sub- model agreement for any K (maximum p = 0.0185, p < 0.05). Thus, sub-model agreement 
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Average Improvement in Filtering Error 

10 1 ^ ^ ^ ' ' 




Classify 2/8 3/8 4/8 5/8 6/8 7/8 
Labels Kept 



Figure 11: Improvement over Loopy BP and constituent tree- models on the synthetic segmentation task. 
Error bars show standard error. 

does not explain the improvement over Loopy BP, indicating that sub-model disagreement is not 
related to the difficulty in inference problems that causes Loopy BP to underperform relative to 
the ensembles (e.g., due to convergence failure.) 



6.3.2 Articulated Pose Tracking Cascade 

The VideoPose dataselQ consists of 34 video clips of approximately 50 frames each. The clips were 
harvested from three popular TV shows: 3 from Buffy the Vampire Slayer^ 27 from Friends^ and 
4 from LOST. Clips were chosen to highlight a variety of situations and and movements when the 
camera is largely focused on a single actor. In our experiments, we use the Buffy and half of the 
Friends clips as training (17 clips), and the remaining Friends and LOS'T clips for testing. In total 
we test on 901 individual frames. The Friends are split so no clips from the same episode are 
used for both training and testing. We further set aside 4 of the Friends test clips to use as a 
development set. Each frame of each clip is hand-annotated with locations of joints of a full pose 
model; for simplicity, we use only the torso and upper arm annotations in this work, as these have 
the strongest continuity across frames and strong geometric relationships. 

All of the models we evaluated on this dataset share the same basic structure: a variable for 
each limb's (x, y) location and angle rotation (torso, left arm, and right arm) with edges between 
torso and arms to model pose geometry. We refer to this basic model, evaluated independently on 
each frame, as the "Single Frame" approach. For the VideoPose dataset, we augmented this model 



by adding edges between limb states in adjacent frames (Figure 10), forming an intractable, loopy 



^The VideoPose dataset is available online at http://vision.grasp.upeiiii.edu/video/. 
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Figure 12: Qualitative test results. Points shown are the position of left /right shoulders and torsos at the 
last level of the ensemble SC (blue square, green dot, white circle resp.). Also shown (green line segments) 
are the best-fitting hypotheses to groundtruth joints, selected from within the top 4 max-marginal values. 
Shown as dotted gray lines is the best guess pose returned by the (Ferrari et al. 



2008 . 



model. Our features in a single frame are the same as in the beginning levels of the pictorial structure 
cascade (Section 6.2): unary features are discretized Histogram of Gradient (HoG) part detectors 
scores, and pairwise terms measure relative displacement in location and angle between neighboring 
parts. Pairwise features connecting limbs across time also express geometric displacement, allowing 
our model to capture the fact that human limbs move smoothly over time. 

We learned a coarse-to-fine structured cascade with six levels for tracking as follows. The six 
levels use increasingly finer state spaces for joint locations, discretized into bins of resolution 10 x 10 
up to 80 X 80, with each stage doubling one of the state space dimensions in the refinement step. 
All levels use an angular discretization of 24 bins. For the ensemble cascade, we learned three 
sub- models simultaneously (Figure 10), with each sub-model accounting for temporal consistency 
for a different limb by adding edges connecting the same limb in consecutive frames. 

A summary of results are presented in Figure 13, We compared the single- frame cascade and 
the ensemble cascade to a state-of-the-art single-frame pose detector (Ferrari et al. ([Ferrari et al.[ 



2008)) and to one of the individual sub-models, modeling torso consistency only ("Torso Only") 



We evaluated the method from ( Ferrari et al.[ 2008| ) on only the first half of the test data due to 
computation time (taking approximately 7 minutes/frame). We found that the ensemble cascade 
was the most accurate for every joint in the model, that all cascades outperformed the state-of-the- 
art baseline, and, interestingly, that the single-frame cascade outperformed the torso-only cascade. 
We suspect that the poor performance of the torso-only model may arise because propagating only 
torso states through time leads to an over-reliance on the relatively weak torso signal to determine 
the location of all the limbs. Sample qualitative output from the ensemble is presented in Figure 



7 Conclusion 

We presented Structured Prediction Cascades, a framework for adaptively increasing the complexity 
of structured models on a per-example basis while maintaining efficiency of inference. This allows for 
the construction and training of structured models of far greater complexity than was previously 
possible. We proposed two novel loss functions, filtering loss and efficiency loss, that measure 
the two objectives balanced by the cascade, and provided generalization bounds for these loss 
functions. We proposed a simple sub-gradient based learning algorithm to minimize these losses, 
and presented a stage-wise learning algorithm for the entire cascade in Algorithm [T] We also 
show how to extend the previous algorithm and theoretical results to the setting in which exact 
inference is intractable, using Ensemble-SPC. Finally, we showed experimentally state-of-the-art 
performance across multiple domains. 
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Torso Shoulders Elbows Torso Shoulders Elbows 



(a) Decoding Error. (b) Top K = 4 Error. 





State 




Efficiency 


Level 


Dimensions 


in top K=4 


(%) 





10 X 10 X 24 






2 


20 X 20 X 24 


98.8 


87.5 


4 


40 X 40 X 24 


93.8 


96.9 


6 


80 X 80 X 24 


84.6 


99.2 



(c) Ensemble efficiency. 



Figure 13: (a),(b): Prediction error for VideoPose dataset. Reported errors are the average distance from 
a predicted joint location to the true joint for frames that lie in the [25,75] inter-quartile range (IQR) of 
errors. Error bars show standard errors computed with respect to clips. All SC models outperform (^Ferrari 



et al. 2008); the "torso only" persistence cascade introduces additional error compared to a single- frame 



cascade, but adding arm dependencies in the ensemble yields the best performance, (c): Summary of test set 
filtering efficiency and accuracy for the ensemble cascade. PCP0.25 measures Oracle % of correctly matched 
limb locations given unfiltered states; see ( |Sapp et al. , 2010b) for more details. 
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A Proofs of Theorems [T] and [2] 



We first summarize the Rademaclier and Gaussian complexity definitions and results from Bartlett 
and Mendelson| ( |2002[ ) required to prove the theorems. 



Definition 6 (Rademacher and Gaussian complexities). Let H : X M. be a function class and 



. . , be n independent samples from a fixed distribution. Define the random variables: 



R{H) = E, 



G(H) 



sup 



sup 



2 

n 



i=l 



n 



aih{x^ 



gih{x' 



00 ^ • • • ^ 00 



x^ x^ 



(27) 



(28) 



1=1 

where G ±1 are independent uniform and G M are independent standard Gaussian. Then 
R{H) = K[R(H)] and G{H) — E[G(i7)] are the Rademacher and Gaussian complexities of H. 

Consider a general loss function $(^, h(x)) where h(x) G represents the prediction function. 
In our case, h(x) is vector of clique assignment scores 0^fc(x, yc) of dimension ^^.^c iD^d, indexed by 
He (a clique and its assignment). This vector h(x) contains all the information needed to compute 
the max-marginals and threshold for a given example x. Both Ce and Cf can be written in this 
general form, as we detail below. 

Definition 7 (Lipschitz continuity with respect to Euclidean norm). Let (f) : ^ then (f) is 
Lipschitz continuous with constant L{(j)) with respect to Euclidean norm if for any zi,Z2 G : 

|,^(Z1)-^(Z2)| <L(,^)||Z1-Z2||2. (29) 

We recall the relevant results in the following theorem: 

Theorem 3 (Bartlett and Mendelson, 2002). Gonsider a loss function $ : 3^ x ^ M and a 

dominating cost function (j) \ y x ^ R such that z) < (j){y^7). Let H \ X ^ be a 
vector-valued class of functions. Then for any integer n and any {) < 5 < 1, with probability 1 — 5 
over samples of length n, every h in H satisfies 



E[$(y,h(x))] <%(y,h(x))] + i?„(,^ o H) + 



81n(2/(5) 



n 



(30) 



where (j) o H is a class of functions defined by centered composition of (j) with h G o h = 
(/>(y,h(x))-0(y,O). 

Furthermore, Rademacher complexity can be bounded using Gaussian complexity: there are 
absolute constants c and C such that for every class H and every integer n, 



cRn{H) < Gn{H) < {Clnn)Rn{H). 

Let H \ X ^ be a class of functions that is the direct sum of real-valued classes Hi, 
Then, for every integer n and every sample {x^ , . . . , x'^), 

m 

Gn{<i>oH)<2L{4>)Y,Gn{Hi), 



(31) 
(32) 



where L{(j)) is the Lipschitz constant of (j) with respect to Euclidean distance. Finally, for the 2- 
norm-bounded linear class of functions, H — {x ^ 0^f(x) | \\0\\2 < ||f(x)||2 < 1}; 



Gn{H) < 



2B 



n 



(33) 
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A.l Proof of Theorem [T] 

We will express our loss functions and Lj and dominating loss functions C?^ and in the 
in terms of the framework above. We reproduce the definitions side-by-side in a slightly modified 
form below, where m = X^cgC and the 7-margin step- function dominates the step- function 
^7(^) ^ 1 < 0] by construction: 

lif[x,y',d,a) = l[^(x,y)-T(x;0,a) <0], (34) 
Cj{x,y;9,a) = r^{9(x,y) - r{x;9,a)), (35) 

jCe{x,y;e,a) = - V l[T(x;0,a)-r(x,2/c) <0], (36) 
m ^-^ 

ceC,yceyc 

C2{x,y;e,a) = — V r^(T(x; 6>, a) - 6>'^(x, ^c))- (37) 

ceC,yceyc 

We "vectorize" our scoring function 9 and assignments y by defining vector- valued functions, 
where the vectors are indexed by clique assignments, ?/c, with total dimension m. 

Definition 8 (Vectorization). 

hy^ix) 4 e'^{,{x,yc) (38) 

vy^iy') 4 = (39) 

^(x,y) = h(x)^v(y) (40) 

Clearly, the m-dimensional vector h(x) contains all the information needed to compute the 
max-marginals and threshold for a given example x (we assume a is fixed). Hence we can define 
the losses in the form of Theorem |3l 

$/(y,h(x)) = Cf{x,y;d,a) (41) 

MyMx)) = C}{x,y;9,a) (42) 

$e(y,h(x)) = Ce{x,y;9,a) (43) 

(/>e(|/,h(x)) = C2{x,y;9,a) (44) 

What remains is to calculate the Lipschitz constants of 0/ and ^g. 

Theorem 4. (j)f{y^') and 0e(y, •) are Lipschitz (with respect to Euclidean distance on W^) with 
constant ^J2\C\\/^ for all y . 

To prove Theorem [4| we bound Lipschitz constants of constituent functions oi <j)f and ^g. 

Lemma 4. Fix any y ^ y and let : i-^ R 6e defined as 

01 (z) = z^\-(y) — maxz^v(y'). 

Then 0i(zi) — (/)i(z2) < \/2|C|||zi — Z2II2 /or an^ zi, Z2 E M^. 
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Proof. For brevity of notation in the proof below, we define v = v(y), vi = v(argmax^/ zjv{y')) 
and V2 = v(argmax^/ Z2^v(y')), with ties broken arbitrarily but deterministically. Then, 

01 (Zi) - (/)1(Z2) = zjv - zjvi - zjv + zjv2 

= (Z2 - Zi)^(v2 - V) + ZJ{V2 - Vi) 

< (Z2 - Zi)^(v2 - V) 

< ||Z2 — Z1II2IIV2 — v||2 

< y2|^||zi-Z2||2. 

The last three steps follow (1) from the fact that Vi maximizes zjv{y^) (so that zj {v2 — vi) is 
negative), (2) from the Cauchy-Schwarz inequality, and (3) from the fact that there are |C| cliques, 
each of which can contribute at most a single non-zero entry in v or V2. □ 

Lemma 5. Fix any y E y and let <p2 ' ^ M 6e defined as 
Then (/)2(zi)-(/)2(z2) < ^2^11 zi - Z2II2 Jot any zi,Z2 G . 

Proof. Let v = v(?/), vi^^/ = v(argmax^//.^//^^/ zjv(y'')) and V22;; = v(argmax^//.^„^^/ zJv(?/'0). 



02(zi) - 02(Z2) = — ^ zjv - zjviy^^ - zjv + Z^V2y'^ 

^ cec,y^^eyc 

= ~ Yl - Z1)'^(V22;/ - V) + zj{v2y'^ - Vi^/ ) 

cec.y'^eyc 

(Z2-Z1)^(V22;;-V) 



m 

cec,y'^eyc 



< - V VWWl^i - Z2II2 = VmW^i - Z2II2. 



m 

cec,y',eyc 

The inequalities follow using a similar argument to previous lemma, but made separately for each 

y'c- □ 

Lemma 6. Fix any y E y and let 



0(x,y) 



03(z) = a(/)i{z) + (1 — a)(p2{z) = z'^v(y) — I «maxz'^v(y') + — max z^w{y") 



y' m ^y"'.y'^=y'c 

y'c 



where the over-braces show the relationship to the score of the correct label sequence and the thresh- 
old, assuming z = h(x). Then (j)3{zi) — (f)s{z2) < ^/2\C\\\zl — Z2II2 for any zi,Z2 G M.^ and the 
Lipschitz constant of (t)f = o ^3 is bounded by ^J2\C\I^. 

Proof. Combining two previous lemmas we have that 

03(zi) - 03(Z2) = Q^(0l(zi) - 01 (Z2)) + (1 - Q^)(02(zi) 02(z2)) < V'2|C|||zi Z2II2. 

To show that 0/ is Lipschitz continuous with constant a/2|C|/7, we note that 0/ = o ^3 so 
L(</>/) = L(r^).L(,^3)< ^2^/7- □ 
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Next, we show that 0e is Lipschitz continuous with the same constant. 
Lemma 7. Fix c ^ C and yc ^ y cind let cp^y^] : \-^R be defined as 



6\„ i(z) = ( max z^viy^)] — \ amaxz^vf'ivO + - — — max z^viy^' 

■^i^/^J^ ^ \y':y^=y, ' J \ y' m ^ y^' -.y'^-y'^, 



yc 

where the over-braces show the relationship to max-marginal of yc and the threshold and, assuming 
z = h(x). Then (t)[y^]{zi) - (I)[y^](z2) < ^/2\C\\\zl - Z2II2 for any zi,Z2 G M"^. 

Proof. We once again apply the trick from the proof of Lemma [4j Let 

vi = v(argmaxz7v(^')), V2 = v(argmaxz^v(^')), 
y' y' 

^i^:, ^ v(argmaxz7v(?/'0), V2^/, = v(argmax zjv(/)). 

Then, we have: 

1 — a 

y' 



[2/c](2^l) - ^[2/c](2^2) = {zjwiy^ - zlw2y,) + «(z^V2 - Z^Vi) + — ^(zjv2^/, - Z^Vi^/J 

777/ ^ ^ 



< (Zi - Z2)^Vi2;^ + a{z2 - Zi)^V2 + V'(Z2 " Zi)^V2y, 



= ~ I^^^i ~ ^2)^ (vi,,, - av2 - (1 - a)v2j,/, 

< - V vmw^i - Z2112 = vwiw^i - Z2112. 

J 

Here once again we have condensed the argument similar to Lemma [4j 



□ 



Finally, we note that (/>e(z) = ^/^^i^j{4^[yc]i^))- Therefore L{(j)e) — 1/m y^2|C|/7 = 
y^2|C|/7, thus completing the proof of Theorem [ij Now turning back to Theorem [l| we note 
that the class of functions H we are working with is the direct sum of m linear classes each 
bounded by norm B. Hence we complete the proof of Theorem [ij by using Theorem |3j with 



Rn{4^f ^ H) = Rn(4^e ^ H) < ^^^^^^^ for somc constant c. 

A.2 Proof of Theorem [2] 

We define 



^joint{y, h(x)) = Cjoint{x, y; 9,a) = 1 



1^ joint 



6'p(x, y) - t{x; Op, a)j < 
(y, h{x)) = ^^]ointi^^ ") = Op{x, v) - r{x- Op, a)^ 
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Once again, we fix any y ^ y and for each p, let (similar to Lemma |6| 




bs{zp) = zjv(y) - I amaxzjv(y0 + V max zjv{y') 

Vc 

where the over-braces show the relationship to the score of the correct label sequence under model 
p and the threshold for model assuming = hp(x) of model p. 

Then (j)joint{yi^{x)) — rj{J2p^s{J2p'^p)) has Lipschitz constant ^y/2|C|P/7, since we can apply 

Lemma 6 



for each p, and (j)joint{yi h(x)) = v) ~ ^(^'■^ ^) j Lipschitz constant at 

C|P/7 because if composition with and the sum of P identical terms. In Theorem 



most y2 

our function class H is the direct sum of m * P linear classes each bounded by norm P/P, hence 



Rn{^ joint ^ H) < ^^^^^J^^^^ for somc constant c. 
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